This lab focuses on data modelling using $k$ nearest neighbours regression. It's a direct counterpart to the linear regression from Lab 06 and the decision tree regression in Lab 07b. At the end of the lab, you should be able to use scikit-learn
to:
Let's start by importing the packages we'll need. As in lab 07a, we're going to use the neighbors
subpackage from scikit-learn
to build k nearest neighbours models.
In [ ]:
%matplotlib inline
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
Next, let's load the data. This week, we're going to load the Auto MPG data set, which is available online at the UC Irvine Machine Learning Repository. The dataset is in fixed width format, but fortunately this is supported out of the box by pandas
' read_fwf
function:
In [ ]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
df = pd.read_fwf(url, header=None, names=['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model year', 'origin', 'car name'])
According to its documentation, the Auto MPG dataset consists of eight explantory variables (i.e. features), each describing a single car model, which are related to the given target variable: the number of miles per gallon (MPG) of fuel of the given car. The following attribute information is given:
Let's start by taking a quick peek at the data:
In [ ]:
df.head()
As the car name is unique for each instance (according to the dataset documentation), it cannot be used to predict the MPG by itself so let's drop it as a feature and use it as the index instead:
Note: It seems plausible that MPG efficiency might vary from manufacturer to manufacturer, so we could generate a new feature by converting the car names into manufacturer names, but for simplicity lets just drop them here.
In [ ]:
df = df.set_index('car name')
df.head()
According to the documentation, the horsepower column contains a small number of missing values, each of which is denoted by the string '?'
. Again, for simplicity, let's just drop these from the data set:
In [ ]:
df = df[df['horsepower'] != '?']
Usually, pandas
is smart enough to recognise that a column is numeric and will convert it to the appropriate data type automatically. However, in this case, because there were strings present initially, the value type of the horsepower column isn't numeric:
In [ ]:
df.dtypes
We can correct this by converting the column values numbers manually, using pandas
' to_numeric
function:
In [ ]:
df['horsepower'] = pd.to_numeric(df['horsepower'])
# Check the data types again
df.dtypes
As can be seen, the data type of the horsepower column is now float64
, i.e. a 64 bit floating point value.
According to the documentation, the origin variable is categoric (i.e. origin = 1 is not "less than" origin = 2) and so we should encode it via one hot encoding so that our model can make sense of it. This is easy with pandas
: all we need to do is use the get_dummies
method, as follows:
In [ ]:
df = pd.get_dummies(df, columns=['origin'])
df.head()
As can be seen, one hot encoding converts the origin column into separate binary columns, each representing the presence or absence of the given category. Because we're going to use a decsion tree regression model, we don't need to worry about the effects of multicollinearity, and so there's no need to drop one of the encoded variable columns as we did in the case of linear regression.
Next, let's take a look at the distribution of the variables in the data frame. We can start by computing some descriptive statistics:
In [ ]:
df.describe()
Print a matrix of pairwise Pearson correlation values:
In [ ]:
df.corr()
Let's also create a scatter plot matrix:
In [ ]:
pd.plotting.scatter_matrix(df, s=50, hist_kwds={'bins': 10}, figsize=(16, 16));
Based on the above information, we can conclude the following:
For now, we'll just note this information, but we'll come back to it later when improving our model.
Let's build a nearest neighbours regression model to predict the MPG of a car based on its other attributes. scikit-learn
supports decision tree functionality via the neighbors
subpackage. This subpackage supports both nearest neighbours regression and classification. We can use the KNeighborsRegressor
class to build our model.
KNeighborsRegressor
accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params
method of the estimator (this works on any scikit-learn
estimator), like this:
In [ ]:
KNeighborsRegressor().get_params()
You can find a more detailed description of each parameter in the scikit-learn
documentation.
As we are dealing with several features of different scales, we should also rescale our features before fitting the model as the distance measures used by nearest neighbours can be sensitive to this. One way to do this is by standardizing the features prior to fitting the model using the StandardScaler
class. As with the classification example, we can create a pipeline to capture the series of transformation operations we want to apply before fitting the model, as shown below.
Let's use a grid search to select the optimal nearest neighbours regression model from a set of candidates. As before, we first define the parameter grid. Then, we can use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.
In [ ]:
X = df.drop('mpg', axis='columns') # X = features
y = df['mpg'] # y = prediction target
pipeline = make_pipeline(
StandardScaler(),
KNeighborsRegressor()
)
# Build models for different values of n_neighbors (k), distance metric and weight scheme
parameters = {
'kneighborsregressor__n_neighbors': [2, 5, 10, 15, 20],
'kneighborsregressor__metric': ['manhattan', 'euclidean'],
'kneighborsregressor__weights': ['uniform', 'distance']
}
# Use inner CV to select the best model
inner_cv = KFold(n_splits=5, shuffle=True, random_state=0) # K = 5
clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1) # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)
# Use outer CV to evaluate the error of the best model
outer_cv = KFold(n_splits=10, shuffle=True, random_state=0) # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)
# Print the results
print('Mean absolute error: %f' % mean_absolute_error(y, y_pred))
print('Standard deviation of the error: %f' % (y - y_pred).std())
ax = (y - y_pred).hist()
ax.set(
title='Distribution of errors for the nearest neighbours regression model',
xlabel='Error'
);
Our nearest neighbors regression model predicts the MPG with an average error of approximately ±1.99 with a standard deviation of 2.85, which is better than our final linear regression model from Lab 06 and comparable to our random forest regression model from Lab 07b. It's also worth noting that we were able to achieve this level of accuracy with very little feature engineering effort (albeit a little more than with decision tree regression). This is because the nearest neighbours algorithm does not rely on the same set of assumptions (e.g. linearity) as linear regression, and so is able to learn from data with less manual tuning.
We can check the parameters that led to the best model via the best_params_
attribute of the output of our grid search, as follows:
In [ ]:
clf.best_params_
Further improvements can be made by expanding the ranges of parameter grid values.